AITopics | vision-language representation

Interpreta view of the lighthouseandsky person works at his desk in officedifferent concepts(a)(b)(c)Vision RepresentationLanguage RepresentationConcept Activationthe same concept

Neural Information Processing SystemsJun-16-2026, 18:26:04 GMT

However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination.

large language model, machine learning, natural language, (22 more...)

Neural Information Processing Systems

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.92)

Industry: Leisure & Entertainment > Sports (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

Neural Information Processing SystemsJun-12-2026, 14:20:41 GMT

Pre-training vision-language representations on human action videos has emerged as a promising approach to reduce reliance on large-scale expert demonstrations for training embodied agents. However, prior methods often employ time contrastive learning based on goal-reaching heuristics, progressively aligning language instructions from the initial to the final frame. This overemphasis on future frames can result in erroneous vision-language associations, as actions may terminate early or include irrelevant moments in the end. To address this issue, we propose Action Temporal Coherence Learning (AcTOL) to learn ordered and continuous vision-language representations without rigid goal-based constraint. AcTOL treats a video as a continuous trajectory where it (1) contrasts semantic differences between frames to reflect their natural ordering, and (2) imposes a local Brownian bridge constraint to ensure smooth transitions across intermediate frames. Extensive imitation learning experiments on both simulated and real robots show that the pretrained features significantly enhance downstream manipulation tasks with high robustness to different linguistic styles of instructions, offering a viable pathway toward generalized embodied agents. Our project page is at https://actol-pretrain.github.io/.

artificial intelligence, machine learning, proceedings, (7 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Robots (0.97)

Add feedback

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Neural Information Processing SystemsJun-11-2026, 21:25:45 GMT

However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in the hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity.

artificial intelligence, machine learning, natural language, (13 more...)

Neural Information Processing Systems

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.59)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.59)

Add feedback

Delving into Out-of-Distribution Detection with Vision-Language Representations

Neural Information Processing SystemsDec-25-2025, 13:26:21 GMT

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of OOD detection from a single-modal to a multi-modal regime. Particularly, we propose Maximum Concept Matching (MCM), a simple yet effective zero-shot OOD detection method based on aligning visual features with textual concepts. We contribute in-depth analysis and theoretical insights to understand the effectiveness of MCM. Extensive experiments demonstrate that MCM achieves superior performance on a wide variety of real-world tasks.

name change, out-of-distribution detection, vision-language representation, (5 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Shen, Shufan, Sun, Junshu, Huang, Qingming, Wang, Shuhui

arXiv.org Artificial IntelligenceOct-27-2025

The alignment of vision-language representations endows current Vision-Language Models (VLMs) with strong multi-modal reasoning capabilities. However, the interpretability of the alignment component remains uninvestigated due to the difficulty in mapping the semantics of multi-modal representations into a unified concept set. To address this problem, we propose VL-SAE, a sparse autoencoder that encodes vision-language representations into its hidden activations. Each neuron in its hidden layer correlates to a concept represented by semantically similar images and texts, thereby interpreting these representations with a unified concept set. To establish the neuron-concept correlation, we encourage semantically similar representations to exhibit consistent neuron activations during self-supervised training. First, to measure the semantic similarity of multi-modal representations, we perform their alignment in an explicit form based on cosine similarity. Second, we construct the VL-SAE with a distance-based encoder and two modality-specific decoders to ensure the activation consistency of semantically similar representations. Experiments across multiple VLMs (e.g., CLIP, LLaVA) demonstrate the superior capability of VL-SAE in interpreting and enhancing the vision-language alignment. For interpretation, the alignment between vision and language representations can be understood by comparing their semantics with concepts. For enhancement, the alignment can be strengthened by aligning vision-language representations at the concept level, contributing to performance improvements in downstream tasks, including zero-shot image classification and hallucination elimination. Codes are available at https://github.com/ssfgunner/VL-SAE.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2510.21323

Country:

Europe (1.00)
Asia (0.67)
North America > United States > California (0.28)

Genre:

Research Report > Experimental Study (1.00)
Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.67)

Add feedback

the visual observation with a fixed, randomly initialized target network [Random Network Distillation;

Neural Information Processing SystemsAug-17-2025, 08:24:30 GMT

Effective exploration is a challenge in reinforcement learning (RL).

machine learning, natural language, reinforcement learning, (17 more...)

Neural Information Processing Systems

Country: Europe > United Kingdom > England > Greater London > London (0.05)

Genre: Research Report > New Finding (0.68)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Vision (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.67)

Add feedback

Multilingual Diversity Improves Vision-Language Representations

Neural Information Processing SystemsAug-13-2025, 06:03:26 GMT

Massive web-crawled image-text datasets lay the foundation for recent progress in multimodal learning. These datasets are designed with the goal of training a model to do well on standard computer vision benchmarks, many of which, however, have been shown to be English-centric (e.g., ImageNet). Consequently, existing data curation techniques gravitate towards using predominantly English image-text pairs and discard many potentially useful non-English samples. Multilingual data is inherently enriching not only because it provides a gateway to learn about culturally salient concepts, but also because it depicts common concepts differently from monolingual data. We thus conduct a systematic study to explore the performance benefits of using more samples of non-English origins with respect to English vision tasks.

artificial intelligence, machine learning, vision-language representation, (5 more...)

Neural Information Processing Systems

Country: Africa (0.07)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.78)
Information Technology > Artificial Intelligence > Vision (0.60)

Add feedback

Delving into Out-of-Distribution Detection with Vision-Language Representations

Neural Information Processing SystemsJan-19-2025, 04:00:20 GMT

Recognizing out-of-distribution (OOD) samples is critical for machine learning systems deployed in the open world. The vast majority of OOD detection methods are driven by a single modality (e.g., either vision or language), leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of OOD detection from a single-modal to a multi-modal regime. Particularly, we propose Maximum Concept Matching (MCM), a simple yet effective zero-shot OOD detection method based on aligning visual features with textual concepts. We contribute in-depth analysis and theoretical insights to understand the effectiveness of MCM.

ood detection method, out-of-distribution detection, vision-language representation, (3 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

LIV: Language-Image Representations and Rewards for Robotic Control

Ma, Yecheng Jason, Liang, William, Som, Vaidehi, Kumar, Vikash, Zhang, Amy, Bastani, Osbert, Jayaraman, Dinesh

arXiv.org Artificial IntelligenceJun-1-2023

We present Language-Image Value learning (LIV), a unified objective for vision-language representation and reward learning from action-free videos with text annotations. Exploiting a novel connection between dual reinforcement learning and mutual information contrastive learning, the LIV objective trains a multi-modal representation that implicitly encodes a universal value function for tasks specified as language or image goals. We use LIV to pre-train the first control-centric vision-language representation from large human video datasets such as EpicKitchen. Given only a language or image goal, the pre-trained LIV model can assign dense rewards to each frame in videos of unseen robots or humans attempting that task in unseen environments. Further, when some target domain-specific data is available, the same objective can be used to fine-tune and improve LIV and even other pre-trained representations for robotic control and reward specification in that domain. In our experiments on several simulated and real-world robot environments, LIV models consistently outperform the best prior input state representations for imitation learning, as well as reward specification methods for policy synthesis. Our results validate the advantages of joint vision-language representation and reward learning within the unified, compact LIV framework.

artificial intelligence, liv, representation, (16 more...)

arXiv.org Artificial Intelligence

2306.00958

Country: North America > United States > Pennsylvania (0.04)

Genre: Research Report > New Finding (0.86)

Technology: Information Technology > Artificial Intelligence > Robots (1.00)

Add feedback

Filters

Collaborating Authors

vision-language representation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Interpreta view of the lighthouseandsky person works at his desk in officedifferent concepts(a)(b)(c)Vision RepresentationLanguage RepresentationConcept Activationthe same concept

Provable Ordering and Continuity in Vision-Language Pretraining for Generalizable Embodied Agents

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

Delving into Out-of-Distribution Detection with Vision-Language Representations

VL-SAE: Interpreting and Enhancing Vision-Language Alignment with a Unified Concept Set

the visual observation with a fixed, randomly initialized target network [Random Network Distillation;

Multilingual Diversity Improves Vision-Language Representations

Delving into Out-of-Distribution Detection with Vision-Language Representations

LIV: Language-Image Representations and Rewards for Robotic Control